Exploiting structural information for semi-structured document categorization

نویسندگان

  • Andrej Bratko
  • Bogdan Filipic
چکیده

This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Structural Information in Semi-structured Document Classification

We investigate methods for exploiting structural information in semi-structured documents in order to improve classification performance of the popular Naive Bayes text classifier. A novel method based on natural language modeling is introduced which effectively combines the expressive power of a structureaware classifier with more reliable parameter estimation of the flat-text model. We provid...

متن کامل

Weighted Naive Bayes Model for Semi-Structured Document Categorization

The aim of this paper is the supervised classification of semi-structured data. A formal model based on bayesian classification is developed while addressing the integration of the document structure into classification tasks. We define what we call the structural context of occurrence for unstructured data, and we derive a recursive formulation in which parameters are used to weight the contri...

متن کامل

Semi-Structured Document Classification

INTRODUCTION Document classification developed over the last ten years, using techniques originating from the pattern recognition and machine learning communities. All these methods do operate on flat text representations where word occurrences are considered independents. The recent paper (Sebastiani, 2002) gives a very good survey on textual document classification. With the development of st...

متن کامل

Book Recommending Using Text Categorization with Extracted Information

Content-based recommender systems suggest documents, items, and services to users based on learning a pro le of the user from rated examples containing information about the given items. Text categorization methods are very useful for this task but generally rely on unstructured text. We have developed a bookrecommending system that utilizes semi-structured information about items gathered from...

متن کامل

Detecting Data and Schema Changes in Scientific Documents

Data stored in a data warehouse must be kept consistent and up-to-date with respect to the underlying information sources. By providing the capability to identify, categorize and detect changes in these sources, only the modified data needs to be transfered and entered into the warehouse. Another alternative, periodically reloading from scratch, is obviously inefficient. When the schema of an i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 42  شماره 

صفحات  -

تاریخ انتشار 2006